-
Notifications
You must be signed in to change notification settings - Fork 687
Support aoti_torch_cuda__weight_int4pack_mm #15030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` [ghstack-poisoned]
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` ghstack-source-id: a543a05 Pull Request resolved: #15030
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15030
Note: Links to docs will display an error until the docs builds have been completed. ❗ 2 Active SEVsThere are 2 currently active SEVs. If your PR is affected, please view them below:
❌ 3 New Failures, 4 Unrelated FailuresAs of commit 7dbdad2 with merge base afd98fe ( NEW FAILURES - The following jobs have failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
cuda_shim_cpp_unittest("aoti_torch__reinterpret_tensor") | ||
cuda_shim_cpp_unittest("aoti_torch_copy_") | ||
cuda_shim_cpp_unittest("aoti_torch_cuda_guard") | ||
cuda_shim_cpp_unittest("aoti_torch_cuda__weight_int4pack_mm") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larryliu0820 , I didn't find a CMakeLists.txt for all these unit tests. I suppose we can only test them in fbcode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline for additional documentations (i used claude code to generate docs)
This is great, thank you!
ret0 != nullptr, | ||
InvalidArgument, | ||
"aoti_torch_cuda__weight_int4pack_mm failed: ret0 is null"); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ET_CHECK_OR_RETURN_ERROR(
qGroupSize == 32 || qGroupSize == 64 || qGroupSize == 128 || qGroupSize == 256,
InvalidArgument,
"aoti_torch_cuda__weight_int4pack_mm: qGroupSize must be 32/64/128/256, got %lld",
static_cast<long long>(qGroupSize));
#endif | ||
|
||
AOTITorchError aoti_torch_cuda__weight_int4pack_mm( | ||
Tensor* self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should check whether self is bfloat16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have quite a few tensor checking in the actual _weight_int4pack_mm_cuda function, so we don't have to do repeat them here?
|
||
AOTITorchError aoti_torch_cuda__weight_int4pack_mm( | ||
Tensor* self, | ||
Tensor* mat2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check whether mat2 is int32
Wait, why is the "Run latency" slower than in int4 cc @swolchok |
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` Differential Revision: [D84395275](https://our.internmc.facebook.com/intern/diff/D84395275) [ghstack-poisoned]
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` ghstack-source-id: a0c94a0 Pull Request resolved: #15030
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` Differential Revision: [D84395275](https://our.internmc.facebook.com/intern/diff/D84395275) [ghstack-poisoned]
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents. Using the Voxtral runner as an example, With the bfloat16 format, here is the generated ptd file size and latency. ``` aoti_cuda_blob.ptd: 9.0 GB Program load latency (ms): 0.054 Method load latency (ms): audio_encoder: 1492.989 token_embedding: 803.561 text_decoder: 6556.770 Run latency (ms): audio_encoder: 76.848 token_embedding: 6.479 text_decoder: 149.128 ``` With `--qlinear 4w_hqq --qlinear_encoder 4w_hqq`, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts. ``` aoti_cuda_blob.ptd: 3.7 GB Program load latency (ms): 0.051 Method load latency (ms): audio_encoder: 716.667 token_embedding: 633.476 text_decoder: 1840.760 Run latency (ms): audio_encoder: 329.274 token_embedding: 4.285 text_decoder: 335.590 ``` ghstack-source-id: 29b5b16 Pull Request resolved: #15030
@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@jerryzh168 , I did a quick |
this kernel is only optimized for batch size 1, is this what you are testing? |
depends on the hardware I think, if it has to be A100, then seems like the only other option is gemlite kernels, which is written in triton if H100, we can integrate fbgemm kernels, the effort will be similar to the current PR I think. |
63b8a91
into
gh/desertfire/1/base
This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: #15030 by @desertfire ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/desertfire/1/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/desertfire/1/head Merge bot PR base: https://github.com/pytorch/executorch/tree/main Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/desertfire/1/orig Differential Revision: [D84395275](https://our.internmc.facebook.com/intern/diff/D84395275) @diff-train-skip-merge Co-authored-by: Bin Bao <[email protected]>
Stack from ghstack (oldest at bottom):
Summary: When quantizing a model with 4w_hqq (huggingface/optimum-executorch#164), AOTI-generated code will call aoti_torch_cuda__weight_int4pack_mm as a fallback op. This PR borrows the CUDA implementation of _weight_int4pack_mm_cuda from libtorch, by replacing at::Tensor and relevant utility functions with ET equivalents.
Using the Voxtral runner as an example (tested on H100),
With the bfloat16 format, here is the generated ptd file size and latency.
With
--qlinear 4w_hqq --qlinear_encoder 4w_hqq
, the ptd file size is cut more than half, with slowdowns in the encoder and decoder parts.Here is the result with
--qlinear 4w --qlinear_encoder 4w
, where weights are quantized, but the linear is done with dequant + fp16_matmul. Comparing with 4w_hqq, the generated file size is a bit larger, but the computation is surprisingly faster. Needs more investigation.Differential Revision: D84395275